05 - Taming Text


In [1]:
from wordcloud import WordCloud
from nltk.corpus import stopwords
from nltk.sentiment import *
import pandas as pd
import numpy as np
import nltk
import time
import matplotlib.pyplot as plt
import seaborn as sns
import pycountry
%matplotlib inline


/Users/Thomas/anaconda/lib/python3.5/site-packages/nltk/twitter/__init__.py:20: UserWarning: The twython library has not been installed. Some functionality from the twitter package will not be available.
  warnings.warn("The twython library has not been installed. "

In [2]:
# import data
directory = 'hillary-clinton-emails/'
aliases = pd.read_csv(directory+'aliases.csv')
email_receivers = pd.read_csv(directory+'EmailReceivers.csv')
emails = pd.read_csv(directory+'Emails.csv')
persons = pd.read_csv(directory+'Persons.csv')

Comparison between extracted body text and raw text


In [3]:
i = 2
print(emails['ExtractedBodyText'][i], '\n\n END OF BODY TEXT \n\n', emails['RawText'][i])


Thx 

 END OF BODY TEXT 

 UNCLASSIFIED
U.S. Department of State
Case No. F-2015-04841
Doc No. C05739547
Date: 05/14/2015
STATE DEPT. - PRODUCED TO HOUSE SELECT BENGHAZI COMM.
SUBJECT TO AGREEMENT ON SENSITIVE INFORMATION & REDACTIONS. NO FOIA WAIVER.
RELEASE IN
PART B6
From: Mills, Cheryl D <MillsCD@state.gov>
Sent: Wednesday, September 12, 2012 11:52 AM
To: B6
Cc: Abedin, Huma
Subject: Re: Chris Stevens
Thx
Original Message
From: Anne-Marie Slaughter [
Sent: Wednesday, September 12, 2012 07:46 AM
To: Ihdr22@clintonernail.com' <hdr22@clintonemail corn>
Cc: Abed in, Huma; Mills, Cheryl D
Subject: Chris Stevens
To you and all my former colleagues, I'm so terribly sorry. Our FSOs stand on the front lines just as surely and devotedly
as our soldiers do. Thinking of you and Pat and everyone this morning.
AM
UNCLASSIFIED
U.S. Department of State
Case No. F-2015-04841
Doc No. C05739547
Date: 05/14/2015
STATE DEPT. - PRODUCED TO HOUSE SELECT BENGHAZI COMM.
SUBJECT TO AGREEMENT ON SENSITIVE INFORMATION & REDACTIONS. NO FOIA WAIVER. STATE-5CB0045248


By reading a few emails we can see that the extracted body text is just the text that the email sender wrote (as stated on Kaggle) while the raw text gathers the previous emails forwarded or the whole discussion. Note that the extracted body text can sometimes contain NaNs. By including repeated messages in the raw text, you induce bias in the distribution of the words, thus we kept only the body text

1. Worldclouds


In [4]:
# raw corpus
text_corpus = emails.ExtractedBodyText.dropna().values
raw_text = ' '.join(text_corpus)

# generate wordcloud
wordcloud = WordCloud().generate(raw_text)
plt.figure(figsize=(15,10))
plt.imshow(wordcloud)
plt.axis('off');



In [8]:
def preprocess(text, stemmer):
    print('Length of raw text: ', len(raw_text))
    
    # tokenization (need to install models/punk from nltk.download())
    tokens = nltk.word_tokenize(raw_text, language='english')
    print('Number of tokens extracted: ', len(tokens))
    
    # stopwords removal (need to install stopwords corpus in corpora/stopwords)
    # cach stopwords to improve performance (70x speedup)
    cached_stopwords = set(stopwords.words('english'))
    filtered_tokens = [word for word in tokens if word not in cached_stopwords]
    print('Number of tokens after stopword removal: ', len(filtered_tokens))
    
    # stemming
    if stemmer == 'snowball':
        stemmer = nltk.SnowballStemmer('english')
    elif stemmer == 'porter':
        stemmer = nltk.PorterStemmer('english')
    else: 
        print('choose appropriate stemmer')
    stemmed_filtered_tokens = [stemmer.stem(t) for t in filtered_tokens]
    
    # dump array in text file
    output = ' '.join(stemmed_filtered_tokens)
    with open("preprocessed_text.txt", "w") as text_file:
        text_file.write(output)
        
preprocess(raw_text, 'snowball')


Length of raw text:  3601322
Number of tokens extracted:  697009
Number of tokens after stopword removal:  475544

In [9]:
preprocessed_text = open('preprocessed_text.txt').read()
wordcloud2 = WordCloud().generate(preprocessed_text)
plt.figure(figsize=(15,10))
plt.imshow(wordcloud2)
plt.axis('off');


Comparison between the word clouds

Looking at the wordcloud generated after having preprocessed the data, it seems that stemming hurt the "performance" of the wordcloud, indeed a number of words have been incorrectly stemmed e.g. department has been reduced to depart, secretary to secretary, message to messag and so on.

2. Sentiment analysis

from this link,we added the following words to be removed from the emails:

  • add "RE" because the iso3 code for Reunion islands.. but it appears a lot in emails to indicate the RE(sponses) to previous emails.
  • FM
  • TV is ISO2 code for Tuvalu but also refers to Television
  • AL is a given name and also ISO2 for Albania
  • BEN is a given name and also ISO3 for Benin
  • LA is Los angeles and iso 2 for Lao
  • AQ is abbreviation of "As Quoted" and iso 2 for Antarctica

After a few runs, we looked at the (unusual) countries extracted. For example the country Saint Pierre and Miquelon is mentionned 631 times, not bad for such a small country. We noticed that an important number of emails have words capitalized and are misinterpreted as ISO2/ISO3 codes for countries. To cope with this we added the following stop words:

  • AND is ISO3 for Andorra
  • AM is ISO2 for Armenia
  • AT is ISO2 for Austria
  • IN is ISO2 for India
  • NO is ISO2 for Norway
  • PM is iSO2 for Saint Pierre and Miquelon
  • TO is ISO2 for Tonga
  • BY is ISO2 for Belarus
  • IE is ISO2 for Ireland (id est)
  • IT is ISO2 for Italy
  • MS is ISO2 for Montserrat

In [11]:
def find_countries(tokens):
    # find countries in a list of token
    countries = []
    for token in tokens:
        try:
            # search for any alpha_2 country name e.g. US, CH
            country = pycountry.countries.get(alpha_2=token)
            countries.append(country.name)            
        except KeyError:
            try:
                # search for any alpha_3 country name e.g. USA, CHE
                country = pycountry.countries.get(alpha_3=token)
                countries.append(country.name)
            except KeyError:
                try:
                    # search for a country by its name, title() upper cases every first letter but lower cases
                    # the other, hence it is handled last, but it deals with country written in lower case
                    country = pycountry.countries.get(name=token.title())
                    countries.append(country.name)
                except KeyError: pass
    return list(set(countries))

In [23]:
def foreign_policy(emails, sentiment_analyzer):
    start_time = time.time()
    words_to_be_removed = ["RE", "FM", "TV", "LA", "AL", "BEN", "AQ", "AND", "AM", "AT", "IN", "NO", "PM", "TO",
                          "BY", "IE", "IT", "MS"]
    vader_analyzer = SentimentIntensityAnalyzer()
    foreign_policy = {}
    cached_stopwords = set(stopwords.words('english'))
    cached_stopwords.update(words_to_be_removed)
    i=0

    for email in emails: # TODO: regex instead of tokens lookup parce que ca prend trop de teeeeemps
        #print('{:d} / {:d} emails processed'.format(i, len(emails)))
        tokens = nltk.word_tokenize(email, language='english')
        tokens = [word for word in tokens if word not in cached_stopwords]
        # country lookup in tokens
        countries = find_countries(tokens)
        i +=1
        if not countries: continue
        
        if sentiment_analyzer =='vader':
            sentiment = vader_analyzer.polarity_scores(email)
            score = sentiment['compound']
        #elif sentiment_analyzer ==''
              
        for country in countries:
            if not country in foreign_policy.keys():
                foreign_policy.update({country: [score, 1]})
            else:
                foreign_policy[country][0] += score
                foreign_policy[country][1] += 1
    for country, value in foreign_policy.items():
        foreign_policy.update({country: [(value[0]/value[1]), value[1]]})
    print("--- %d seconds elapsed ---" % (time.time() - start_time))
    return foreign_policy

In [40]:
result = foreign_policy(text_corpus, sentiment_analyzer='vader')


--- 1501.872615814209 seconds elapsed ---

In [42]:
result


Out[42]:
{'Afghanistan': [0.4165027777777779, 108],
 'Albania': [0.03340000000000004, 2],
 'Algeria': [0.32343333333333335, 3],
 'American Samoa': [0.526425, 4],
 'Angola': [0.4704666666666667, 9],
 'Anguilla': [0.4939, 1],
 'Antarctica': [0.8316, 1],
 'Argentina': [0.48769230769230776, 13],
 'Armenia': [0.821925, 4],
 'Australia': [0.539875, 16],
 'Austria': [0.034600000000000034, 5],
 'Azerbaijan': [0.41691428571428574, 7],
 'Bahamas': [0.9975, 2],
 'Bahrain': [0.3338666666666667, 3],
 'Bangladesh': [0.26303333333333334, 6],
 'Barbados': [0.87245, 2],
 'Belarus': [0.34, 1],
 'Belgium': [0.19532222222222226, 9],
 'Bermuda': [-0.6346333333333333, 3],
 'Bosnia and Herzegovina': [0.9399, 1],
 'Brazil': [0.414296, 25],
 'British Indian Ocean Territory': [0.9835, 1],
 'Brunei Darussalam': [-0.08082500000000001, 4],
 'Burundi': [0.9985, 1],
 'Cabo Verde': [0.6441666666666667, 3],
 'Cambodia': [0.49875, 2],
 'Cameroon': [0.396075, 4],
 'Canada': [0.35501363636363636, 22],
 'Cayman Islands': [0.9992, 1],
 'Chad': [-0.9936, 1],
 'Chile': [0.3441916666666667, 12],
 'China': [0.4167344444444446, 90],
 'Cocos (Keeling) Islands': [0.85405, 2],
 'Colombia': [0.42876428571428565, 14],
 'Comoros': [0.5904, 1],
 'Congo': [0.11362727272727273, 11],
 'Congo, The Democratic Republic of the': [0.9921666666666668, 3],
 'Costa Rica': [0.19644999999999999, 4],
 'Croatia': [-0.2650333333333333, 3],
 'Cuba': [0.345645, 20],
 'Cyprus': [0.9598999999999999, 3],
 'Czechia': [0.9861, 1],
 'Denmark': [0.255075, 4],
 'Dominican Republic': [0.4404, 1],
 'Ecuador': [0.60605, 8],
 'Egypt': [0.388384375, 32],
 'Estonia': [0.439625, 12],
 'Ethiopia': [0.40805, 8],
 'Faroe Islands': [0.9327, 1],
 'Finland': [0.15746666666666667, 3],
 'France': [0.03322307692307692, 26],
 'Gabon': [0.37496666666666667, 6],
 'Gambia': [0.406525, 4],
 'Georgia': [0.8871333333333333, 9],
 'Germany': [0.029796296296296296, 27],
 'Ghana': [0.4003333333333334, 3],
 'Gibraltar': [-0.005199999999999982, 2],
 'Greece': [0.3252555555555555, 9],
 'Greenland': [0.8495999999999999, 2],
 'Guam': [0.0, 1],
 'Guatemala': [0.15386666666666668, 3],
 'Guinea': [0.4105666666666666, 3],
 'Guinea-Bissau': [-0.43720000000000003, 2],
 'Guyana': [0.9565, 1],
 'Haiti': [0.41891012658227844, 79],
 'Holy See (Vatican City State)': [0.6319571428571429, 7],
 'Honduras': [0.2891851851851852, 27],
 'Hong Kong': [0.7964, 1],
 'Hungary': [0.2096, 1],
 'Iceland': [0.27065000000000006, 10],
 'India': [0.6457557692307696, 52],
 'Indonesia': [0.511713043478261, 23],
 'Iraq': [0.24864647887323935, 71],
 'Ireland': [0.29438749999999997, 32],
 'Israel': [0.35718202247191005, 89],
 'Italy': [0.49867000000000006, 10],
 'Jamaica': [0.4428272727272727, 11],
 'Japan': [0.5321000000000001, 25],
 'Jersey': [0.6233500000000001, 10],
 'Jordan': [0.30178260869565215, 23],
 'Kazakhstan': [0.11173333333333331, 3],
 'Kenya': [0.5766875, 8],
 'Korea, Republic of': [0.0, 1],
 'Kuwait': [0.2588, 6],
 'Kyrgyzstan': [0.8941, 2],
 'Latvia': [-0.994, 1],
 'Lebanon': [-0.17607999999999996, 5],
 'Lesotho': [-0.7118, 1],
 'Liberia': [0.27976, 5],
 'Libya': [0.018543103448275863, 58],
 'Liechtenstein': [0.9957, 1],
 'Lithuania': [0.82075, 6],
 'Luxembourg': [0.9715, 2],
 'Macao': [0.9992, 1],
 'Madagascar': [-0.0013500000000000179, 2],
 'Malawi': [0.16895, 2],
 'Malaysia': [0.40893999999999997, 5],
 'Maldives': [-0.97695, 2],
 'Malta': [0.8869, 1],
 'Mauritius': [0.8797, 1],
 'Mexico': [0.34602894736842105, 38],
 'Moldova, Republic of': [0.9988, 2],
 'Monaco': [-0.08377499999999999, 4],
 'Mongolia': [0.93915, 2],
 'Montenegro': [0.030075000000000032, 4],
 'Morocco': [0.7178625, 16],
 'Mozambique': [0.7241, 1],
 'Myanmar': [0.71782, 5],
 'Namibia': [-0.4968, 2],
 'Nauru': [0.4999, 2],
 'Nepal': [0.6757, 1],
 'Netherlands': [0.03064285714285713, 7],
 'New Caledonia': [0.58855, 2],
 'New Zealand': [0.6497, 3],
 'Nicaragua': [0.42887142857142857, 14],
 'Niger': [-0.7118, 1],
 'Nigeria': [0.6729666666666666, 6],
 'Northern Mariana Islands': [0.3496666666666667, 9],
 'Norway': [0.3464333333333333, 12],
 'Oman': [0.36, 2],
 'Pakistan': [0.31420985915492966, 71],
 'Palau': [0.6025076923076924, 13],
 'Panama': [0.5432960000000001, 25],
 'Papua New Guinea': [0.2432, 1],
 'Paraguay': [0.0, 1],
 'Peru': [-0.03495000000000002, 4],
 'Philippines': [0.72122, 5],
 'Poland': [-0.16584375, 16],
 'Portugal': [0.42910000000000004, 6],
 'Puerto Rico': [0.1303923076923077, 13],
 'Qatar': [0.48852142857142855, 14],
 'Romania': [0.35132500000000005, 4],
 'Russian Federation': [0.0, 1],
 'Rwanda': [0.48134, 5],
 'Samoa': [0.72265, 2],
 'San Marino': [0.9994, 1],
 'Senegal': [0.9987, 1],
 'Serbia': [0.16387499999999994, 4],
 'Seychelles': [0.9992, 1],
 'Sierra Leone': [0.6709249999999999, 4],
 'Singapore': [0.7850111111111111, 18],
 'Slovakia': [0.9638, 1],
 'Slovenia': [-0.006233333333333313, 3],
 'Somalia': [0.5170416666666667, 12],
 'South Sudan': [-0.39106, 5],
 'Spain': [0.15328571428571441, 14],
 'Sudan': [0.2317857142857143, 14],
 'Suriname': [0.9879, 1],
 'Sweden': [0.42581428571428576, 7],
 'Switzerland': [0.49155000000000004, 4],
 'Tajikistan': [0.9925, 1],
 'Thailand': [0.8663, 4],
 'Timor-Leste': [0.0, 1],
 'Trinidad and Tobago': [0.5809333333333333, 3],
 'Tunisia': [0.2060714285714286, 7],
 'Turkey': [0.3778066666666666, 30],
 'Turkmenistan': [0.99455, 2],
 'Tuvalu': [0.99795, 2],
 'Uganda': [0.13835, 6],
 'Ukraine': [0.6756444444444445, 9],
 'United Kingdom': [0.7806833333333333, 6],
 'United States': [0.3995103092783504, 194],
 'Uruguay': [0.279475, 8],
 'Uzbekistan': [0.9981333333333334, 3],
 'Virgin Islands, U.S.': [0.25460000000000005, 5],
 'Yemen': [-0.005116666666666658, 12],
 'Zambia': [-0.4142666666666666, 3],
 'Zimbabwe': [0.9976, 1]}

In [54]:
pycountry.countries.get(name='Honduras')


Out[54]:
Country(alpha_2='HN', alpha_3='HND', name='Honduras', numeric='340', official_name='Republic of Honduras')

In [80]:
def create_palette(sentiments):
    color_palette = []
    minimum = np.min(sentiments)
    maximum = np.max(sentiments)
    for sentiment in sentiments:
        rescaled = (sentiment-minimum) / (maximum - minimum)
        g = rescaled
        r = 1 - g
        color_palette.append((r,g,0))
    return color_palette

Plotting the foreign policy


In [87]:
df = pd.DataFrame.from_dict(result, orient='index')
df.reset_index(inplace=True)
df.columns =['Country', 'Sentiment', 'Count']
df = df[df['Count'] > 15]
df = df.sort_values('Sentiment', ascending=False)
gradient = create_palette(df['Sentiment'].values)
plt.figure(figsize=(15,7))
plot = sns.barplot(x='Country', y='Count', data=df, orient='vertical', palette=gradient)
plt.xticks(rotation=45);
plt.ylabel('Sentiment towards country');



In [45]:
pycountry.countries.get(name='Palau')


Out[45]:
Country(alpha_2='PW', alpha_3='PLW', name='Palau', numeric='585', official_name='Republic of Palau')

In [26]:
test_sentence = "and here I am AM TO speaking of France"
test_sentence = "This is a typical sentence, with don't. Punkts, something e.g. words US, U.S.A"
cached_stopwords = set(stopwords.words('english'))
words_to_be_removed = ["RE", "FM", "TV", "LA", "AL", "BEN", "AQ", "AND", "AM", "AT"]
cached_stopwords.update(words_to_be_removed)
tokens = nltk.word_tokenize(test_sentence)
#tokens = [word for word in tokens if word not in cached_stopwords]
countries = find_countries(tokens)
print(tokens)


['This', 'is', 'a', 'typical', 'sentence', ',', 'with', 'do', "n't", '.', 'Punkts', ',', 'something', 'e.g', '.', 'words', 'US', ',', 'U.S.A']

In [ ]:
test_sentence = 'This is a very pleasant day.'
#test_sentence = 'this is a completely neutral sentence'
polarity = {'Positive': 1, 'Neutral': 0, 'Negative': -1}
vader_analyzer = SentimentIntensityAnalyzer()
tokens = nltk.word_tokenize(test_sentence)
#tokens.remove('is')
result = ' '.join(tokens)
sentiment = vader_analyzer.polarity_scores(result)
#mean = -sentiment['neg'] + sentiment['pos']
#print(sentiment, mean)
np.max(sentiment.values())

In [ ]:
test_set = ['nice nice good USA US switzerland', 'bad good bad bad bad libya', 'Switzerland good nice nice']
words_to_be_removed = ["RE", "FM", "TV", "LA", "AL", "BEN", "AQ"]
vader_analyzer = SentimentIntensityAnalyzer()
country_counts = {}
country_sentiments = {}
foreign_policy = {}
for email in test_set:
    tokens = nltk.word_tokenize(email, language='english')
    tokens = [word for word in tokens if word not in words_to_be_removed]
    clean_email = ' '.join(tokens)
    sentiment = vader_analyzer.polarity_scores(clean_email)
    score = sentiment['compound']
    # country lookup in raw text 
    countries = find_countries(tokens)
    for country in countries:
        if not country in foreign_policy.keys():
            foreign_policy.update({country: [score, 1]})
        else:
            foreign_policy[country][0] += score
            foreign_policy[country][1] += 1

for country, value in foreign_policy.items():
    foreign_policy.update({country: [(value[0]/value[1]), value[1]]})